Skip to content

Conversation

@natoverse
Copy link
Collaborator

  • Pulls input document processing into its own package
  • Revamps the factory to match our factory pattern
  • Adds jsonl and markitdown as input processors
  • Separates the storage config into its own block
  • Cleans up metadata handling to be entirely a chunking concern unrelated to document ingest

- Create new graphrag-input package with input loading utilities
- Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text)
- Add get_property utility for nested dictionary access with dot notation
- Include hashing utility for document ID generation
- Update all imports throughout codebase to use graphrag_input
- Add package to workspace configuration and release tasks
- Remove old graphrag.index.input module
- Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk
- Add 'original' field to TextChunk to track pre-transform text
- Add optional transform callback to chunker.chunk() method
- Add add_metadata transformer for prepending metadata to chunks
- Update create_chunk_results to apply transforms and populate original
- Update sentence_chunker and token_chunker with transform support
- Refactor create_base_text_units to use new transformer pattern
- Rename pluck_metadata to get/collect methods on TextDocument
@natoverse natoverse requested a review from a team as a code owner January 10, 2026 00:43
@natoverse natoverse merged commit 710fdad into v3/main Jan 12, 2026
14 checks passed
@natoverse natoverse deleted the input-factory branch January 12, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants